A new feature selection algorithm based on binomial hypothesis testing for spam filtering

نویسندگان

  • Jieming Yang
  • Yuanning Liu
  • Zhen Liu
  • Xiaodong Zhu
  • Xiaoxu Zhang
چکیده

Content-based spam filtering is a binary text categorization problem. To improve the performance of the spam filtering, feature selection, as an important and indispensable means of text categorization, also plays an important role in spam filtering. We proposed a new method, named Bi-Test, which utilizes binomial hypothesis testing to estimate whether the probability of a feature belonging to the spam satisfies a given threshold or not. We have evaluated Bi-Test on six benchmark spam corpora (pu1, pu2, pu3, pua, lingspam and CSDMC2010), using two classification algorithms, Naïve Bayes (NB) and Support Vector Machines (SVM), and compared it with four famous feature selection algorithms (information gain, v-statistic, improved Gini index and Poisson distribution). The experiments show that Bi-Test performs significantly better than v-statistic and Poisson distribution, and produces comparable performance with information gain and improved Gini index in terms of F1 measure when Naïve Bayes classifier is used; it achieves comparable performance with the other methods when SVM classifier is used. Moreover, Bi-Test executes faster than the other four algorithms. 2011 Elsevier B.V. All rights reserved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Hybrid Approach for Email Spam Detection based on Scatter Search Algorithm and K-Nearest Neighbors

Because cyberspace and Internet predominate in the life of users, in addition to business opportunities and time reductions, threats like information theft, penetration into systems, etc. are included in the field of hardware and software. Security is the top priority to prevent a cyber-attack that users should initially be detecting the type of attacks because virtual environments are not moni...

متن کامل

A Classification Method for E-mail Spam Using a Hybrid Approach for Feature Selection Optimization

Spam is an unwanted email that is harmful to communications around the world. Spam leads to a growing problem in a personal email, so it would be essential to detect it. Machine learning is very useful to solve this problem as it shows good results in order to learn all the requisite patterns for classification due to its adaptive existence. Nonetheless, in spam detection, there are a large num...

متن کامل

The study on the spam filtering technology based on Bayesian algorithm

This paper analyzed spam filtering technology, carried out a detailed study of Naive Bayes algorithm, and proposed the improved Naive Bayesian mail filtering technology. Improvement can be seen in text selection as well as feature extraction. The general Bayesian text classification algorithm mostly takes information gain and cross-entropy algorithm in feature selection. Through the principle o...

متن کامل

A New Method for Characterization of Biological Particles in Microscopic Videos: Hypothesis Testing Based on a Combination of Stochastic Modeling and Graph Theory

Introduction Studying motility of biological objects is an important parameter in many biomedical processes. Therefore, automated analyzing methods via microscopic videos are becoming an important step in recent researches. Materials and Methods In the proposed method of this article, a hypothesis testing function is defined to separate biological particles from artifact and noise in captured v...

متن کامل

Fast SFFS-Based Algorithm for Feature Selection in Biomedical Datasets

Biomedical datasets usually include a large number of features relative to the number of samples. However, some data dimensions may be less relevant or even irrelevant to the output class. Selection of an optimal subset of features is critical, not only to reduce the processing cost but also to improve the classification results. To this end, this paper presents a hybrid method of filter and wr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Knowl.-Based Syst.

دوره 24  شماره 

صفحات  -

تاریخ انتشار 2011